A Few More Data Wrangling Tools

STAT 331

Peer Code Review

ggplot(data = surveys, mapping = aes(x=hindfoot_length,y= weight)) +  
  geom_jitter(alpha=.2,color='tomato')+ facet_wrap(~species)+geom_boxplot(outlier.shape = NA)+labs(
    title ='Weight to hindfoot comparison'
  )+ xlab('length (mm)')+ylab('Weight(g)')



What feedback would you give?

Defining Grades in 331


  • A: Superior Attainment of Course Objectives

  • B: Good Attainment of Course Objectives

  • C: Acceptable Attainment of Course Objectives

  • D: Poor Attainment of Course Objectives

A few words about drop_na()

  • Easy tool to remove missing values
  • Unilaterally removes any row with a missing value for any variable
  • But you can specify what columns it should look at for missing values!

Summarizing Frequencies

count() – counts the values of one or more categorical variables

starwars |> 
  count(homeworld)

The sort argument specifies if the resulting tibble should be sorted in descending order

starwars |> 
  count(homeworld, 
        sort = TRUE)

Finding Unique Groups

distinct() – selects the unique / distinct rows from a dataset

Arguments

  • ... – variables to use when determining uniqueness
    • can use multiple!
  • .keep_all – decides if all of the columns should be kept
    • FALSE is default!

Discritizing Variables

  • if_else()
    • Useful when there are two options
  • case_when()
    • Useful when there are three or more options

What if I want to perform the same operation across multiple columns?

across()

makes it easy to apply the same transformation to multiple columns, allowing you to use select() semantics inside in “data-masking” functions like summarise() and mutate()


across(.cols = everything(), .fns = NULL, ...)

Summarizing Multiple Columns

starwars |> 
  summarise(
    across(
      height:mass, 
      mean, 
      na.rm = TRUE
      )
    )
# A tibble: 1 × 2
  height  mass
   <dbl> <dbl>
1   174.  97.3

Conditional Summarizing

starwars |> 
  summarise(
    across(
      where(is.numeric), 
      mean, 
      na.rm = TRUE
      )
    )
# A tibble: 1 × 3
  height  mass birth_year
   <dbl> <dbl>      <dbl>
1   174.  97.3       87.6

❤️ |>

starwars |> 
  drop_na(homeworld) |> 
  filter(gender == "feminine") |>
  ggplot(mapping = aes(y = homeworld, fill = homeworld)) + 
  geom_bar(position = "dodge") + 
  labs(title = "Homeworlds of Feminine Starwars Characters", 
       y = "") + 
  theme(legend.position = "none", 
        plot.title = element_text(size = 28), 
        axis.text.x = element_text(size = 20),
        axis.text.y = element_text(size = 20), 
        axis.title.x = element_text(size = 24)
        )


If you use a |> instead of a +:

Error in `validate_mapping()`:
Did you use %>% instead of +?

❤️ |>

Implication of Data Ethics

Data Science Oath

I will not be ashamed to say, “I know not,” nor will I fail to call in my colleagues when the skills of another are needed for solving a problem.

I will respect the privacy of my data subjects, for their data are not disclosed to me that the world may know, so I will tread with care in matters of privacy and security.

I will remember that my data are not just numbers without meaning or context, but represent real people and situations, and that my work may lead to unintended societal consequences, such as inequality, poverty, and disparities due to algorithmic bias.

ASA Ethical Guidelines

The American Statistical Association’s Ethical Guidelines for Statistical Practice are intended to help statistics practitioners make decisions ethically. Additionally, the ethical guidelines aim to promote accountability by informing those who rely on statistical analysis of the standards they should expect.

Institutional Review Board

IRB reviews help to ensure that research participants are protected from research-related risks and treated ethically, a necessary prerequisite for maintaining the public’s trust in the research enterprise and allowing science to advance for the common good.

Note

Watch a video about IRB to learn more.